{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "# VIMER-StrucTexT 1.0\n",
    "\n",
    "## 1.Introduction\n",
    "\n",
    "VIMER-StrucTexT is a joint segment-level and token-level representation enhancement model for document image understanding, such as pdf, invoice, receipt and so on. Due to the complexity of content and layout in visually rich documents (VRDs), structured text understanding has been a challenging task. Most existing studies decoupled this problem into two sub-tasks: entity labeling and entity linking, which require an entire understanding of the context of documents at both token and segment levels. VIMER-StrucTexT is a unified framework, which can flexibly and effectively deal with the above subtasks in their required representation granularity. Besides, we design a novel pre-training strategy with the self-supervised tasks, named **Segment Length Prediction** and **Paired Box Direction**, to learn a richer multi-modal representation.\n",
    "\n",
    "![structext_arch](./doc/structext_arch.png#pic_center)\n",
    "<center><b>Model Architecture</b></center>\n",
    "\n",
    "## 2.Model Effects\n",
    "\n",
    "We funtune VIMER-StrucTexT on three structured text understanding tasks, i.e, Token-based Entity Labeling (**T-ELB**), Segment-based Entity Labeling (**S-ELB**), and Entity Linking (**S-ELK**). **Note:** Below performance of all **ELB** tasks are the **entity-level F1 score** that calculate Macro F1 scores among related entity types.\n",
    "\n",
    "### 2.1 Token-based Entity Labeling\n",
    "\n",
    "#### 2.1.1 Datasets\n",
    "\n",
    "- [EPHOIE](https://github.com/HCIILAB/EPHOIE) is collected from scanned Chinese examination papers.\n",
    "There are 10 text types, which are marked at the character level, where means a text segment composed of characters with different categories.\n",
    "The required token-based entity types are as follow: ``` Subject, Test Time, Name, School, Examination Number, Seat Number, Class, Student Number, Grade, and Score. ```\n",
    "\n",
    "| Models                          | **entity-level F1 score**          |\n",
    "| :---------------------------- | :----------------------------: |\n",
    "| Base model      |            0.9884              |\n",
    "| Large model      |            0.9930              |\n",
    "\n",
    "### 2.2 Segment-based Entity Labeling\n",
    "\n",
    "#### 2.2.1 Datasets\n",
    "\n",
    "- [SROIE](https://rrc.cvc.uab.es/?ch=13&com=introduction) is a public dateset for receipt information extraction in ICDAR 2019 Chanllenge. It contains of 626 receipts for training and 347 receipts for testing. Every receipt contains four predefined values: `company, date, address, and total`.\n",
    "- [FUNSD](https://guillaumejaume.github.io/FUNSD/) is a form understanding benchmark with 199 real, fully annotated, scanned form images, such as marketing, advertising, and scientific reports, which is split into 149 training samples and 50 testing samples. FUNSD dataset is suitable for a variety of tasks and we force on the sub-tasks of **T-ELB** and **S-ELK** for the semantic entity `question, answer and header`.\n",
    "- [XFUND](https://github.com/doc-analysis/XFUND)  is a multilingual form understanding benchmark dataset that includes human-labeled forms with key-value pairs in 7 languages. Each language includes 199 forms, where the training set includes 149 forms, and the test set includes 50 forms. We evaluate on the Chinses section of XFUND.\n",
    "\n",
    "| Models       | **SROIE**    | **FUNSD**      | **XFUND-ZH**    |\n",
    "| :-----------------------------| :----------------------------: | :----------------------------: | :----------------------------: |\n",
    "| Base model      |           0.9827               |           0.8483              |\n",
    "| Large model     |           0.9870               |           0.8756              |\n",
    "\n",
    "### 2.3 Segment-based Entity Linking\n",
    "\n",
    "#### 2.3.1 Datasets\n",
    "\n",
    "- [FUNSD](https://guillaumejaume.github.io/FUNSD) annotates the links formatted as `(entity_from, entity_to)`, resulting in a question–answer pair. It guides the task of predicting the relations between semantic entities.\n",
    "- [XFUND](https://github.com/doc-analysis/XFUND) is the same setting as FUNSD. We evaluate on the Chinses section of the dataset.\n",
    "\n",
    "| Models       | **SROIE**    | **FUNSD**      | **XFUND-ZH**    |\n",
    "| :-----------------------------| :----------------------------: | :----------------------------: | :----------------------------: |\n",
    "| Base model      |           0.7045               |           0.830               |\n",
    "| Large model     |           0.7421               |           0.8681              |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.Quick Experience\n",
    "\n",
    "### 3.1 Install PaddlePaddle\n",
    "\n",
    "This code base needs to be executed on the `PaddlePaddle 2.1.0+`. You can find how to prepare the environment from this [paddlepaddle-quick](https://www.paddlepaddle.org.cn/install/quick) or use pip:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip3 install paddlepaddle-gpu --upgrade -i https://mirror.baidu.com/pypi/simple"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Model Dependencies\n",
    "\n",
    "Dependencies are listed in file `requirements.txt`, you can use the following command to install these packages.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip3 install --upgrade -r requirements.txt -i https://mirror.baidu.com/pypi/simple"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3 Model Inference\n",
    "#### 3.3.1 Token-based ELB task on EPHOIE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "! python data/make_ephoie_data.py --config_file ./configs/(base/large)/labeling_ephoie.json --label_file examples/ephoie/test_list.txt --label_dir <ephoie_folder>/final_release_image_20201222/ --kvpair_dir <ephoie_folder>/final_release_kvpair_20201222/ --out_dir <ephoie_folder>/test_labels\n",
    "! python ./tools/eval_infer.py --config_file ./configs/(base/large)/labeling_ephoie.json --task_type labeling_token --label_path <ephoie_folder>/test_labels/ --image_path <ephoie_folder>/final_release_image_20201222/ --weights_path StrucTexT_ephoie_(base/large)_labeling.pdparams"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.3.2 Segment-based ELB task on FUNSD"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "! python data/make_funsd_data.py --config_file ./configs/base/labeling_funsd.json --label_dir <funsd_folder>/dataset/testing_data/annotations/ --out_dir <funsd_folder>/dataset/testing_data/test_labels\n",
    "! python ./tools/eval_infer.py --config_file ./configs/(base/large)/labeling_ephoie.json --task_type labeling_token --label_path <ephoie_folder>/test_labels/ --image_path <ephoie_folder>/final_release_image_20201222/ --weights_path StrucTexT_ephoie_(base/large)_labeling.pdparams"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.3.3 Segment-based ELK task on FUNSD"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "! python data/make_funsd_data.py --config_file ./configs/base/linking_funsd.json --label_dir <funsd_folder>/dataset/testing_data/annotations/ --out_dir <funsd_folder>/dataset/testing_data/test_labels\n",
    "! python ./tools/eval_infer.py --config_file ./configs/base/linking_funsd.json --task_type linking --label_path <funsd_folder>/dataset/testing_data/test_labels/ --image_path <funsd_folder>/dataset/testing_data/images/ --weights_path StrucTexT_funsd_base_linking.pdparams"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.Product Applications\n",
    "\n",
    "The following visualized images are sampled from the domestic practical applications of StrucTexT. *Different colors of masks represent different entity categories. There are black lines between different segments or text lines, indicating that they belong to the same entity. And the orange lines indicate the link relationship between entities.*\n",
    "- Shopping receipts\n",
    "![example_receipt](./doc/receipt_vis.png#pic_center)\n",
    "- Bus/Ship receipts\n",
    "![example_busticket](./doc/busticket_vis.png#pic_center)\n",
    "- Printed receipts\n",
    "![example_print](./doc/print_vis.png#pic_center)\n",
    "- For more information and applications, please visit [Baidu OCR](https://ai.baidu.com/tech/ocr) open platform."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.Citation\n",
    "You can cite our paper as below:\n",
    "```\n",
    "@inproceedings{li2021structext,\n",
    "  title={StrucTexT: Structured Text Understanding with Multi-Modal Transformers},\n",
    "  author={Li, Yulin and Qian, Yuxi and Yu, Yuechen and Qin, Xiameng and Zhang, Chengquan and Liu, Yan and Yao, Kun and Han, Junyu and Liu, Jingtuo and Ding, Errui},\n",
    "  booktitle={Proceedings of the 29th ACM International Conference on Multimedia},\n",
    "  pages={1912--1920},\n",
    "  year={2021}\n",
    "}\n",
    "```\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
